Day Three:
Transform

180 min approx

Overview

Questions

  • What are Tidy data, why are they useful, and how to transform untidy data to tidy one?
  • How to select some varaibles/columns only?
  • How to filter rows that match certain conditions?
  • How to modify (even create) the content of a variable?
  • How to handle dates and time in R?
  • How to handle strings in R?

Lesson Objectives

To be able to

  • Use pivot_*, separate, unite function from the tidyr package in the Tidyverse to reshape data into tidy one.
  • Select/filter columns/rows of tibbles (i.e., dataframes).
  • Change content of variable programmatically, possibly ususing content from other variables.
  • convert textual date/time into date/time R objects
  • use simple regular expression to manage strings

Data shape

Tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Untidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Why tidy data

Illustrations from the Openscapes blog Tidy Data for reproducibility, efficiency, and collaboration by Julia Lowndes and Allison Horst

Tidy rules

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

Why untidy data

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.

Example: tidyverse::billboard dataset.1

library(tidyverse)

billboard

Warning

  • information in column:
    • wk1-wk76 should be a single variable: the week.
    • cell values of wk1-wk76 should be a single variable: the rank.

Start Tidying - tidyr::pivot_longer

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
library(tidyverse)

billboard |> 
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank"
  )

Important

  • tidyr::pivot_longer convert your data in “longer” fromat
  • cols: select which variable should be pivoting
  • names_to: define the column hosting the cols colnames
  • values_to: define the column hosting the cols values

Warning

Many possibly uninformative missing information!

Start Tidying - tidyr::pivot_longer

  • Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry, not analysis, easy.
library(tidyverse)

billboard |> 
  pivot_longer(
    cols = starts_with("wk"),
    names_to = "week",
    values_to = "rank",
    values_drop_na = TRUE
  )

Important

  • tidyr::pivot_longer convert your data in “longer” fromat
  • cols: select which variable should be pivoting
  • names_to: define the column hosting the cols colnames
  • values_to: define the column hosting the cols values
  • values_drop_na: decide if rows with missing information in values should be removed

Selectors 1

  • var1:var10: variables lying between var1 on the left and var10 on the right.

  • starts_with("a"): names that start with “a”.

  • ends_with("z"): names that end with “z”.

  • contains("b"): names that contain “b”.

  • matches("x.y"): names that match regular expression x.y. 2

  • num_range(x, 1:4): names following the pattern, x1, x2, …, x4.

  • all_of(vars)/any_of(vars): names stored in the character vector vars. all_of(vars) will error if the variables aren’t present; any_of(var) will match just the variables that exist.

  • everything(): all variables.

  • last_col(): furthest column on the right.

  • where(is.numeric): all variables where is.numeric() returns TRUE.

Tip

  • !selection: only variables that don’t match selection.

  • selection1 & selection2: only variables included in both selection1 and selection2.

  • selection1 | selection2: all variables that match either selection1 or selection2

Multiple variable in colnames

who2

Tip

In case of multiple variable in each colname, you can pivoting them maintaining the underling structure. This way you can separate them in a furhter second step…

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  )

Multiple variable in colnames

who2

Tip

In case of multiple variable in each colname, you can pivoting them maintaining the underling structure. This way you can separate them in a furhter second step usign tidyr::separate.

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  ) |> 
  separate(
    col = diagnosis_gender_age,
    into = c("diagnosis", "gender", "age"),
    sep = "_"
  )

Multiple variable in colnames

who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = "diagnosis_gender_age", 
    values_to = "count"
  ) |> 
  separate(
    col = diagnosis_gender_age,
    into = c("diagnosis", "gender", "age"),
    sep = "_"
  )
who2 |> 
  pivot_longer(
    cols = !(country:year),
    names_to = c("diagnosis", "gender", "age"), 
    names_sep = "_",
    values_to = "count"
  )

Tip

You can also separate colnames containing multiple variables, and matching a regular pattern, into multiple variable in a single step.

tidyr::pivot_wider

Image from Data Carpentry’s R for Social Scientists

Reverse pivot - tidyr::pivot_wider

Animation of tidyverse verbs by Garrick Aden-Buie

Data management

Select

Filter

Mutate

Data formats

Dates

Date-times

Strings - Regular Expressions

Homework

Posit’s RStudio Cloud Workspace

  • Project: Day-3
  • Instructions:
    • Go to: https://bit.ly/ubep-rws-website
    • The text is the Day-3 assessment under the tab “Summative Assessments”.
  • Script to complete on RStudio: solution.R

Acknowledgment

To create the current lesson, we explored, used, and adapted content from the following resources:

The slides are made using Posit’s Quarto open-source scientific and technical publishing system powered in R by Yihui Xie’s Kintr.

Additional resources

License

This work by Corrado Lanera, Ileana Baldi, and Dario Gregori is licensed under CC BY 4.0

References